Communication-Optimal Distributed Clustering
نویسندگان
چکیده
Clustering large datasets is a fundamental problem with a number of applications in machine learning. Data is often collected on different sites and clustering needs to be performed in a distributed manner with low communication. We would like the quality of the clustering in the distributed setting to match that in the centralized setting for which all the data resides on a single server. In this work, we study both graph and geometric clustering problems in two distributed models: (1) a point-to-point model, and (2) a model with a broadcast channel. We give protocols in both models which we show are nearly optimal by proving almost matching communication lower bounds. Our work highlights the surprising power of a broadcast channel for clustering problems; roughly speaking, to cluster n points or n vertices in a graph distributed across s servers, for a worst-case partitioning the communication complexity in a point-to-point model is n · s, while in the broadcast model it is n+ s. We implement our algorithms and demonstrate this phenomenon on real life datasets, showing that our algorithms are also very efficient in practice.
منابع مشابه
General and Robust Communication-Efficient Algorithms for Distributed Clustering
As datasets become larger and more distributed, algorithms for distributed clustering have become more and more important. In this work, we present a general framework for designing distributed clustering algorithms that are robust to outliers. Using our framework, we give a distributed approximation algorithm for k-means, k-median, or generally any `p objective, with z outliers and/or balance ...
متن کاملClustering using a coarse - grained parallel Genetic Algorithm : APreliminary
Genetic Algorithms (GA) are useful in solving complex optimization problems. By posing pattern clustering as an optimization problem, GAs can be used to obtain an optimal minimum squared-error partitions. In order to improve the total execution time, a distributed algorithm has been developed using the divide and conquer approach. Using a standard communication library called PVM, the distribut...
متن کاملA Comparative Study of Issues in Big Data Clustering Algorithm with Constraint Based Genetic Algorithm for Associative Clustering
Clustering can be defined as the process of partitioning a set of patterns into disjoint and homogeneous meaningful groups, called clusters. The growing need for distributed clustering algorithms is attributed to the huge size of databases that is common nowadays. The task of extracting knowledge from large databases, in the form of clustering rules, has attracted considerable attention. Distri...
متن کاملAn Adaptive LEACH-based Clustering Algorithm for Wireless Sensor Networks
LEACH is the most popular clastering algorithm in Wireless Sensor Networks (WSNs). However, it has two main drawbacks, including random selection of cluster heads, and direct communication of cluster heads with the sink. This paper aims to introduce a new centralized cluster-based routing protocol named LEACH-AEC (LEACH with Adaptive Energy Consumption), which guarantees to generate balanced cl...
متن کاملA Fuzzy Clustering Method to Minimize the Inter Task Communication Effect for Optimal Utilization of Processor's Capacity in Distributed Real Time Systems
A distributed processing System is a collection of heterogeneous processors which requires systematic assignment of a set of “m” tasks T = {t1, t2....tm} of a program to a set of “n” processors P = {p1, p2....pn}, (where, m >> n) to achieve the efficient utilization of available processor’s capacity. If this step is not performed properly, an increase in the number of processors may actually re...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016